mujoco environment
- North America > United States > Arizona > Maricopa County > Phoenix (0.04)
- North America > Canada (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
60cb558c40e4f18479664069d9642d5a-AuthorFeedback.pdf
We thank all the reviewers for the time and expertise invested in these reviews. A: We are sorry that some abuse of notations in the paper hinders the5 understanding ofourmethod. A: Such an assumption comes from an empirical41 observation that in robotics control problems, some key poses in different dynamics are still alike.
A Theoretical Derivations
An brief proof is provided as follows. Here, we describe certain implementation details of TEEN. For recurrent optimization mentioned in section 4.2, we set the period of We provide explicit parameters used in our algorithm in Table 1. For reproduction of TD3, we use the official implementation ( https://github.com/sfujim/TD3). Batch size 256 Discount ( γ) 0.99 Number of hidden layers 2 Number of hidden units per layer 256 Activation function ReLU Iterations per time step 1 Target smoothing coefficient ( η) 5 10 V ariance of target policy smoothing 0.2 Noise clip range [ 0.5, 0.5] Target critic update interval 2 16 C Additional Experimental Results The bolded line represents the average evaluation over 5 seeds.
Responses to Review #
We thank all the reviewers for the time and expertise invested in these reviews. Q: What is the meaning of every notation? Their corresponding lowercase letter refer to one instance in the set, e.g. Q: What is the relationship to other Transfer Learning/Imitation Learning method? Since there are no major flaws pointed out in the review, could the reviewer please raise the overall score?